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Abstract 

The Kaczmarz method for solving linear systems of equations is an 
iterative algorithm that has found many applications ranging from com- 
puter tomography to digital signal processing. Despite the popularity 
of this method, useful theoretical estimates for its rate of convergence 
are still scarce. We introduce a randomized version of the Kaczmarz 
method for consistent, overdetermined linear systems and we prove that 
it converges with expected exponential rate. Furthermore, this is the 
first solver whose rate does not depend on the number of equations in 
the system. The solver does not even need to know the whole system, 
but only a small random part of it. It thus outperforms all previously 
known methods on general extremely overdetermined systems. Even 
for moderately overdetermined systems, numerical simulations as well 
as theoretical analysis reveal that our algorithm can converge faster than 
the celebrated conjugate gradient algorithm. Furthermore, our theory 
and numerical simulations confirm a prediction of Feichtinger et al. in 
the context of reconstructing bandlimited functions from nonuniform 
sampling. 

*T.S. was supported by NSF DMS grant 0511461. R.V. was supported by the Alfred P. 
Sloan Foundation and by NSF DMS grant 0401032. 
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1 Introduction and state of the art 



We study a consistent linear system of equations 

Ax = b, (1) 

where A is a full rank mxn matrix with m > n, and b e C m . One of the most 
popular solvers for such overdetermined systems is Kaczmarz's method [28], 
which is a form of alternating projection method. This method is also known 
under the name Algebraic Reconstruction Technique (ART) in computer to- 
mography [211 En], and in fact, it was implemented in the very first medical 
scanner [27]. It can also be considered as a special case of the POCS (Projec- 
tion onto Convex Sets) method, which is a prominent tool in signal and image 
processing [3U H] . 

We denote the rows of A by a*, ... , a* m and let b = (&i, . . . , b m ) T . The 
classical scheme of Kaczmarz's method sweeps through the rows of A in a 
cyclic manner, projecting in each substep the last iterate orthogonally onto 
the solution hyperplane of (ai,x) = hi and taking this as the next iterate. 
Given some initial approximation xq, the algorithm takes the form 

bi (aj, Xk) / (-) \ 

1 1 a i 1 1 2 

where i = k mod m + 1 and || • || denotes the Euclidean norm in C n . Note 
that we refer to one projection as one iteration, thus one sweep in (T5]) through 
all m rows of A consists of m iterations. 

While conditions for convergence of this method are readily established, 
useful theoretical estimates of the rate of convergence of the Kaczmarz method 
(or more generally of the alternating projection method for linear subspaces) 
are difficult to obtain, at least for m > 2. Known estimates for the rate of 
convergence are based on quantities of the matrix A that are hard to compute 
and difficult to compare with convergence estimates of other iterative methods 
(see e.g. [7J El EE] and the references therein). 

What numerical analysts would like to have is estimates of the convergence 
rate in terms of a condition number of A. No such estimates have been known 
prior to this work. The difficulty stems from the fact that the rate of con- 
vergence of (T5]) depends strongly on the order of the equations in ([T|), while 
condition numbers do not depend on the order of the rows of a matrix. 

It has been observed several times in the literature that using the rows of 
A in Kaczmarz's method in random order, rather than in their given order, 
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can greatly improve the rate of convergence, see e.g. [301 ffl ES]- While this 
randomized Kaczmarz method is thus quite appealing for applications, no 
guarantees of its rate of convergence have been known. 

In this paper, we propose the first randomized Kaczmarz method with 
exponential expected rate of convergence, cf. Section [2j Furthermore, this rate 
depends only on the scaled condition number of A and not on the number of 
equations m in the system. The solver does not even need to know the whole 
system, but only a small random part of it. Thus our solver outperforms all 
previously known methods on general extremely overdetermined systems. 

We analyze the optimality of the proposed algorithm as well as of the de- 
rived estimate, cf. Section [3J Section @] contains various numerical simulations. 
In one set of experiments we apply the randomized Kaczmarz method to the 
reconstruction of bandlimited functions from non-uniformly spaced samples. 
In another set of numerical simulations, accompanied by theoretical analysis, 
we demonstrate that even for moderately overdetermined systems, the ran- 
domized Kaczmarz method can outperform the celebrated conjugate gradient 
algorithm. 

Condition numbers. For a matrix A, its spectral norm is denoted by 
|| A || 2 , and its Frobenius norm by ||^4||f- Thus the spectral norm is the largest 
singular value of A, and the Frobenius norm is the square root of the sum of 
the squares of all singular values of A. 

The left inverse of A (which we always assume to exist) is denoted by A^ 1 . 
Thus 1 1 ^4 1 1 1 2 is the smallest constant M such that the inequality ||Ar||2 > 
-p||x||2 holds for all vectors x. 

The usual condition number of A is 

k(A) := pHap- 1 ^. 
A related version is the scaled condition number introduced by Demmel [6]: 

k(A) := waWfWa-%. 

One easily checks that 

1 < < k(A). (3) 



Estimates on the condition numbers of some typical (i.e. random or Toeplitz- 
type ) matrices are known from a big body of literature, see [H El [9], [101 ttH 
1321 1331 I3T1 13"%] and the references therein. 
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2 Randomized Kaczmarz algorithm and its rate 
of convergence 

It has been observed in numerical simulations [301 IH [26] that the convergence 
rate of the Kaczmarz method can be significantly improved when the algo- 
rithm ([2]) sweeps through the rows of A in a random manner, rather than 
sequentially in the given order. In fact, the improvement in convergence can 
be quite dramatic. Here we propose a specific version of this randomized Kacz- 
marz method, which chooses each row of A with probability proportional to its 
relevance - more precisely, proportional to the square of its Euclidean norm. 
This method of sampling from a matrix was proposed in [T4] in the context of 
computing a low-rank approximation of A, see also [31] for subsequent work 
and references. Our algorithm thus takes the following form: 

Algorithm 1 (Random Kaczmarz algorithm). Let Ax = b be a linear sys- 
tem of equations as in ([I]) and let Xq be arbitrary initial approximation to the 
solution of ([T]). For k = 0, 1, . . . compute 

6 r(i) - (a r{i) ,x k ) 
Xk+i — Xk H 1, [jo a r (i), l 4 J 

IK(i)ll2 

where r(i) is chosen from the set {1,2, ... ,m} at random, with probability 
proportional to ||a r (j)|||. 

Our main result states that x k converges exponentially fast to the solution 
of (0Q), and the rate of convergence depends only on the scaled condition number 
k{A). 

Theorem 2. Let x be the solution of ([I]). Then Algorithm^ converges to x 
in expectation, with the average error 

E||x fe - x\\ 2 2 < (l - K(Ay 2 ) k ■ \\x Q - x\\\. (5) 
Proof. There holds 

1 1 2 

\A~ 



EK^'>| 2 ^^FiT2 for all ^GC". (6) 



j=1 " 112 



Using the fact that \\A\\p = YlT=i IWjWl we can wr he (EJ) as 

m II II 2 2 

ES ( Z ^) ^r 2 H 2 for all ^C". (7) 



• =1 ir-\\F 
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The main point of the proof is to view the left hand side in ([7]) as an expectation 
of some random variable. Namely, recall that the solution space of the j-th 
equation of is the hyperplane {y : (y,cij) = bj}, whose normal is p-qj^- 
Define a random vector Z whose values are the normals to all the equations of 
(JTJ), with probabilities as in our algorithm: 

CL /j ill II ^ 7 1 1 2 / \ 

7 ' with probability .. ..» , j = 1, . . . , m. (8) 



ll%ll2 ■ \\M¥ 

Then (JTJ) says that 

E\(z,Z)\ 2 > k{A)- 2 \\z\\ 2 2 for all 2 GC n . (9) 

The orthogonal projection P onto the solution space of a random equation of 
(JTJ) is given by Pz = z — (z — x, Z) Z . 

Now we are ready to analyze our algorithm. We want to show that the error 
— reduces at each step in average (conditioned on the previous steps) 
by at least the factor of (1 — n(A)~ 2 ). The next approximation Xk is computed 
from Xk-\ as Xk = PkXk-i, where P\, P 2 , . . . are independent realizations of 
the random projection P. The vector Xk-i — Xk is in the kernel of Pk- It is 
orthogonal to the solution space of the equation onto which P^ projects, which 
contains the vector x^ — x (recall that x is the solution to all equations). The 
orthogonality of these two vectors then yields 

II l|2 || ||2 II ||2 

H^Jfc X \\2 = \\ x k-l X \\2 H^fc— 1 ^fe||2- 

To complete the proof, we have to bound \\xk-i — XkW 2 . from below. By the 
definition of Xk, we have 

\\ x k-i ~ x fc||2 = {xk-i — x, Z k ) 
where Z\, Z 2 , ■ ■ ■ are independent realizations of the random vector Z. Thus 



\x k -x\\l < (l 



x k -i - x 
\xk-i — x U' 



\Xk— 1 ^l^ - 



Now we take the expectation of both sides conditional upon the choice of the 
random vectors Z\, . . . , Zk-i (hence we fix the choice of the random projections 
Pi, ... , Pk-i and thus the random vectors x±, . . . , Xk-i, and we average over 
the random vector Zk). Then 



E{z 1 ,...,z fe _ l} ||x fc - x\\l < (l - E{z ll ...,z h _ x } 



Xk-l - x 

\%k-i — xlU 
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By (P) and the independence, 

E{Zi,...,z fc _i}||zfc - s||a < (l - K(Ay 2 ) \\x h -i - x\\ 2 2 . 
Taking the full expectation of both sides, we conclude that 
E||x fe - x\\l < (l - k(AY 2 ) E||x fc _i - x\\\. 
By induction, we complete the proof. ■ 



2.1 Quadratic time 

Theorem [2] yields a simple bound on the expected computational complexity 
of the randomized Kaczmarz Algorithm [T] to compute the solution within error 
e, i.e. 

E||xfc — xWl < e 2 \\xo — x\\\. (10) 
The expected number of iterations (projections) k E to achieve an accuracy e is 

Efc £ < n 21 ° e L~ « 2 K (A) 2 logi, (11) 
log(l — e 

where /(n) ~ g(n) means f(n)/g(n) — > 1 as ri — > oo. (Note that k(A) 2 > n 
by so the approximation in fTTTj) becomes tight as the number of equation 
n grows). 

Since each projection can be computed in 0(n) time, and k(A) 2 = 0(n) 
by ((2D) the algorithm takes 0(n 2 ) operations to converge to the solution. This 
should be compared to the Gaussian elimination, which takes 0(mn 2 ) time. 
Strassen's algorithm and its improvements reduce the exponent in Gaussian 
elimination, but these algorithms £1X6 , clS of now, of no practical use. 

Of course, we have to know the (approximate) Euclidean lengths of the 
rows of A before we start iterating; computing them takes 0(nm) time. But 
the lengths of the rows may in many cases be known a priori. For example, all 
of them may be equal to one (as is the case for Vandermonde matrices arising in 
trigonometric approximation) or tightly concentrated around a constant value 
(as is the case for random matrices). 

The number m of the equations is essentially irrelevant for our algorithm. 
The algorithm does not even need to know the whole matrix, but only 0(n) 
random rows. Such Monte-Carlo methods have been successfully developed 
for many problems, even with precisely the same model of selecting a random 
submatrix of A (proportional to the squares of the lengths of the rows), see 
[T%] for the original discovery and [21] for subsequent work and references. 
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3 Optimality 



We discuss conditions under which our algorithm is optimal in a certain sense, 
as well as the optimality of the estimate on the expected rate of convergence. 

3.1 General lower estimate 

For any system of linear equations, our estimate can not be improved beyond 
a constant factor, as shown by the following theorem. 

Theorem 3. Consider the linear system of equations ([T]) and let x be its 
solution. Then there exists an initial approximation xq such that 

- x\\ 2 2 > (1 - 2k/n(A) 2 ) ■ \\x - x\\ 2 2 (12) 

for all k = 1,2,... 

Proof. For this proof we can assume without loss of generality that the 
system (CO) is homogeneous: Ax = 0. Let Xq be a vector which realizes k(A), 
that is k(A) = ||y4|| F ||y4 _1 xo||2 and Hxolh = 1- As in the proof of Theorem [2], 
we define the random normal Z associated with the rows of A by (jHJ). Similar 
to (jHJ), we have E|(xo, Z)\ 2 = k(A)~ 2 . We thus see span(x ) as an "exceptional" 
direction, so we shall decompose 1R™ = span(xo) © (^o)~S writing every vector 
x E R n as 

x = x' ■ Xq + x" , where x' E K, x" E (x ) ± . 

In particular, 

E\Z'\ 2 = k(A)- 2 . (13) 

We shall first analyze the effect of one random projection in our algorithm. 
To this end, let x E M n , ||x||2 < 1, and let z E M n , ||^||2 = 1. (Later, x will be 
the running approximation Xk-i, and z will be the random normal Z). The 
projection of x onto the hyperplane whose normal is z equals 

x\ = x — (x, z)z. 

Since 

(x,z) = x'z' + {x'\z") } (14) 

we have 

\x\-x'\ = \(x,z)z'\ < \x'\\z'\ 2 + \(x",z")z'\ < \z'\ 2 + \(x",z")z'\ (15) 
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because \x'\ < ||x|| 2 < 1. Next, 

IIt-'MI 2 — ll-r"ll 2 — Hi-" — It ~/\?"\\ 2 — \\t-"\\ 2 

1 1 tXj ^|| || 1 1 1 1 «•*-' ^ £f J As 1 1 ^ || <-^' 1 1 2 

= — 2(x, z)(x" , z") + (x, 2;) 2 ||2;"||2 < —2(x, z)(x" , 2") + (x, z) 2 

because H^"^ < || 2: || 2 = 1- Using (IT3|) . we decompose (x,z) as a + b, where 
a = x'z' and 6 = (x", z") and use the identity — 2 (a + 6)6 + (a + 6) 2 = a 2 — b 2 
to conclude that 

IKII2 " Ik"ll2 < I^IVI 2 - (*V> 2 < l*T - (x",z") 2 (16) 

because |x'| < ||x||2 < 1. 

Now we apply f fT5l) and ( {TBI to the running approximation x = Xk-i and 
the next approximation x\ = Xk obtained with a random z = Z k . Denoting 
p k = (x'l, Z'l), we have by (JTSJ) that \x' k - x' k ^\ < \Z' k \ 2 + \p k Z' k \ and by (USD 
that ||a;j£||| — < \Z'k\ 2 ~ \Pk\ 2 - Since x' = 1 and Xq = 0, we have 

k - 11 < e 14 - < E i^;i 2 + E m;i ( i7 ) 



and 



It" II 2 

I x fcll2 



E(KHi-K-il©^El^l'-Eln 



Since ||x^||| > 0, we conclude that Y^=i \Pj\ 2 — Sj=i \^'j\ 2 - Using this, we 
apply Cauchy-Schwartz inequality in ffTTj) to obtain 



\l/2/_-E \ 1/2 



i4-ii<Ei4i 2 +(Ei^i 2 ) (Ei^r) =2Ei^i 2 - 



j=i j=i j=i j=i 



Since all are copies of the random vector Z, we conclude by ( TT3|) that 
E|x' fe - 1| < 2A;E|Z'| 2 < 2k/K(A) 2 . Thus E||x fc || > E|x' fc | > 1 - 2k/K(A) 2 . This 
proves the theorem, actually with the stronger conclusion 

E||x fc - x|| 2 > (1 - 2k/n(A) 2 ) ■ ||x - x|| 2 . 

The actual conclusion follows by Jensen's inequality. ■ 
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3.2 The upper estimate is attained 

If k(A) = 1 (equivalently, if k(A) = \Znby (j3J)), then the estimate in Theorem[2] 
becomes an equality. This follows directly from the proof of Theorem [2j 

Furthermore, there exist arbitrarily large systems and with arbitrarily large 
condition numbers k(A) for which the estimate in Theorem [2] becomes an 
equality. Indeed, let n and m > n be arbitrary numbers. Let also k > y/n 
be any number such that itl/k 2 is an integer. Then there exists a system ([1]) 
of m equations in n variables and with k{A) = k, for which the estimate in 
Theorem [2] becomes an equality for every k. 

To see this, we define the matrix A with the help of any orthogonal set 
ei, . . . , e n in M n . Let the first mj k 2 rows of A be equal to ex, the other rows of 
A be equal to one of the vectors ej, j > 1, so that every vector from this set 
repeats at least m/n 2 times as a row (this is possible because k 2 > n). Then 
k(A) = k (note that (jSJ) is attained for z — ei). 

Let us test our algorithm on the system Ax = with the initial approxi- 
mation xq = ei to the solution x = 0. Every step of the algorithm brings the 
running approximation to with probability k~ 2 (the probability of picking 
the row of A equal to e\ in uniform sampling), and leaves the running approx- 
imation unchanged with probability 1 — ti~ 2 . By the independence, for all k 
we have 

E||xfc — xo ||1 = (l — K ~ 2 ) ' ll^o _ 

4 Numerical experiments and comparisons 

4.1 Reconstruction of bandlimited signals from nonuni- 
form sampling 

The reconstruction of a bandlimited function / from its nonuniformly spaced 
sampling values {/(ifc)} is a classical problem in Fourier analysis, with a wide 
range of applications [2]. We refer to (T2J [13] for various efficient numerical 
algorithms. Staying with the topic of this paper, we focus on the Kaczmarz 
method, also known as POCS (Projection Onto Convex Sets) method in signal 
processing [40J. 

As efficient finite-dimensional model, appropriate for a numerical treat- 
ment of the nonuniform sampling problem, we consider trigonometric poly- 
nomials [20J. In this model the problem can be formulated as follows: Let 



9 



fit) = J2 r l= _ r xie 2ndt , where x = {xi} r l= _ r G C 2r+1 . Assume we are given the 
nonuniformly spaced nodes {tk}™ =1 and the sampling values {/(£fc)}fcLi- Our 
goal is to recover / (or equivalently x). 

The solution space for the j-the equation is given by the hyperplane 

{y:{y,D r {.-t j )) = f{t j )}, 

where D r denotes the Dirichlet kernel D r (t) = X)I=-r e 2mht . Feichtinger and 
Grochenig argued convincingly (see e.g. [12]) that instead of D r (- — tj) one 
should consider the weighted Dirichlet kernels */WjD r (- — tj), where the weight 
w j = tj+1 2 tj ~ 1 , j = l,...,m. The weights are supposed to compensate for 
varying density in the sampling set. 

Formulating the resulting conditions in the Fourier domain, we arrive at 
the linear system of equations [20] 

Ax = 6, where A jjk = y/w~e 27riktl , bj = y fw~f(t j ), (18) 

with j — 1, . . . , to; k — — r, . . . , r. Let use denote n := 2r + 1 then A is an 
m x n matrix. 

Applying the standard Kaczmarz method (the POCS method as proposed 
in [ID]) to (fl8l) means that we sweep through the projections in the natural 
order, i.e., we first project on the hyperplane associated with the first row of 
A, then proceed to the second row, the third row, etc. As noted in [T2j this is a 
rather inefficient way of implementing the Kaczmarz method in the context of 
the nonuniform sampling problem. It was suggested in [5] that the convergence 
can be improved by sweeping through the rows of A in a random manner, but 
no proof of the (expected) rate of convergence was given. [1] also proposed 
another variation of the Kaczmarz method in which one projects in each step 
onto that hyperplane that provides the largest decrease of the residual error. 
This strategy of maximal correction turned out to provide very good conver- 
gence, but was found to be impractical due to the enormous computational 
overhead, since in each step all m projections have to be computed in order 
to be able to select the best hyperplane to project on. It was also observed 
in [1] that this maximal correction strategy tends to select the hyperplanes as- 
sociated with large weights more frequently than hyperplanes associated with 
small weights. 

Equipped with the theory developed in Section [2] we can shed light on the 
observations mentioned in the previous paragraph. Note that the j-th row of 
A in ffTSl) has squared norm equal to nwj . Thus our Algorithm [T] chooses the 
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j-th row of A with probability Wj. Hence Algorithm [T] can be interpreted as 
a probabilistic, computationally very efficient implementation of the maximal 
correction method suggested in [3]. 

Moreover, we can give a bound on the expected rate of convergence of the 
algorithm. Theorem [2] states that this rate depends only on the scaled condi- 
tion number k(A), which is bounded by k(A) \fn by ([3]). The condition number 
k(A) for the trigonometric system (|18|) has been estimated by Grochenig [T9] . 
For instance we have the following 

Theorem 4 (Grochenig). If the distance of every sampling point tj to its 
neighbor on the unit torus is at most 5 < then k(A) < rzff- • In particular, 
if$<i: then k{A) < 3. 

Furthermore we note that our algorithm can be straightforward applied to 
the approximation of multivariate trigonometric polynomials. We refer to [1] 
for condition number estimates for this case. 

In our numerical simulation, we let r = 50, m = 700 and generate the 
sampling points tj by drawing them randomly from a uniform distribution 
in [0, 1] and ordering them by magnitude. We apply the standard Kaczmarz 
method, the randomized Kaczmarz method, where the rows of A are selected 
at random with equal probability (labeled as simple randomized Kaczmarz in 
Figured]), and the randomized Kaczmarz method of Algorithm [1] (where the 
rows of A are selected at random with probability proportional to the 2-norm 
of the rows). We plot the least squares error ||x — Xk\\2 versus the number of 
projections, cf. Figure [U Clearly, Algorithm [1] significantly outperforms the 
other Kaczmarz methods, demonstrating not only the power of choosing the 
projections at random, but also the importance of choosing the projections 
according to their relevance. 



4.2 Comparison to conjugate gradient algorithm 

In recent years conjugate gradient (CG) type methods have emerged as the 
leading iterative algorithms for solving large linear systems of equations, since 
they often exhibit remarkably fast convergence [TTl [2Tj . How does Algorithm [1] 
compare to the CG algorithms? 

The rate of convergence of CGLS applied to Ax = b is bounded by [17] 

/k(A)-l\k 

iFfc ~ A\a*a < 2||x - g IU*-A^ w^ T l ) ' ( 19 ) 
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Figure 1: Comparison of rate of convergence for the randomized Kaczmarz 
method described in Algorithm [I] and other Kaczmarz methods applied to the 
nonuniform sampling problem described in the main text. 



It is known that the CG method may converge faster when the singular 
values of A are clustered [SB]. For instance, take a matrix whose singular 
values, all but one, are equal to one, while the remaining singular value is very 
small, say 10 -8 . While this matrix is far from being well-conditioned, CGLS 
will nevertheless converge in only two iterations, due to the clustering of the 
spectrum of A, cf. [36]. In comparison, the proposed Kaczmarz method will 
converge extremely slowly in this example by Theorem [31 since k(A) ~ 10 8 . 

On the other hand, Algorithm [T] can outperform CGLS on problems for 
which CGLS is actually quite well suited, in particular for random Gaussian 
matrices A, as we show below. 

Solving random linear systems Let Abeamxn matrix whose entries are 
independent N(0, 1) random variables. Condition numbers of such matrices 

1 Note that since we either need to apply CGLS to Ax — b or CG to A* Ax = A*b we 
indeed have to use k(A) — y/k(A*A) here and not y/k(A). The asterisk * denotes complex 
transpose here. 




A* A '■ — 



^{Ay,Ay). 
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are well studied, when the aspect ratio y := n/m < 1 is fixed and the size n of 
the matrix grows to infinity. Then the following almost sure convergence was 
proved by Geman [16] and Silvestein [35] respectively: 



\\A\U 14 /- 1/P- I li2 , /- 

7=- 1 + vs/; — 7= ► 1 - viz- 



Hence 



k(A) -> 1 + (20) 



Also, since Mi£ — > 1 we have 



(21) 



V™ 1 - v 7 ^' 

For estimates that hold for each finite n rather than in the limit, see e.g. pj] 
and [H]. 

Now we compare the expected computation complexities of the random- 
ized Kaczmarz algorithm proposed in Algorithm [1] and CGLS to compute the 
solution within error e for the system (pQ) with a random Gaussian matrix A. 

We estimate the expected number of iterations (projections) k £ for Algo- 
rithm [1] to achieve an accuracy e in (lllj) . Using bound (12T1) . we have 

2n , 1 

(1 - jy) 2 e 

as n —>■ oo. Since each iteration (projection) requires n operations, the total 
expected number of operations is 

2n 2 1 

Complexity of randomized Kaczmarz « — log -. (22) 

(i - Vv) £ 

The expected number of iterations k' £ for CGLS to achieve the accuracy e 
can be estimated using ( fl9|) . First note that the norm || • \\a*a is on average 
proportional to the Euclidean norm ||z||2- Indeed, for any fixed vector z one 
has Ell^H^,^ = E||Az||| = mll^Hg. Thus, when using CGLS for a random 
matrix A, we can expect that the bound ( fl9l) on the convergence also holds 
for the Euclidean norm. 
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Consequently, the expected number of iterations k' e in CGLS to compute 
the solution within accuracy e as in (fIU|) is 

Efc e « , where if(A) = + 

logtf(A) v ' k(A)-l 

By (1201) . for random matrices A of growing size we have K — > 1/y/y almost 
surely. Thus 

21og^ 

EA; e « V- 

log 1 

The main computational task in each iteration of CGLS consists of two matrix 
vector multiplications, one with A and one with A*, each requiring m x n = 
n 2 /y operations. Hence the total expected number of operations is 

An 2 2 

Complexity of CGLS « T • log -. (23) 

V log - e 



It is easy to compare the complexities ( 1221) and (1231) as functions of 
since n 2 and log(l/e) are common terms in both (using the approximation 
log(2/e) ~ log(l/e) for small e), cf. also Figure [2j A simple computation 
shows that (1221) and (1231 are essentially equal when y « |. Hence for Gaussian 
matrices our analysis predicts that Algorithm [T] outperforms CGLS in terms 
of computational efficiency when m > 3n. 

While the computational cost of Algorithm [1] decreases as — decreases, this 
is not the case for CGLS. Therefore it is natural to ask for the optimal ratio 
— for CGLS for Gaussian matrices that minimizes its overall computational 
complexity. It is easy to see that for given e the expression in (T231) is minimized 
if y = 1/e, where e is Euler's number. Thus if we are given an m x n Gaussian 
matrix (with m > en), the most efficient strategy to employ CGLS is to first 
select a random submatrix A^ of size en x n from A (and the corresponding 
subvector of b) and apply CGLS to the subsystem A^ e 'x = b^ e \ This will 
result in the optimal computational complexity 4en 2 log- for CGLS. 

Thus for a fair comparison between the randomized Kaczmarz method and 
CGLS, we will apply CGLS in our numerical simulations to both the "full" 
system Ax = b as well as to a subsystem A^x = b^ e \ where A^ is an en x n 
submatrix of A, randomly selected from A. 

In the first simulation we let A be of dimension 300 x 100, the entries 
of x are also drawn from a normal distribution. We apply both, CGLS and 
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Figure 2: Comparison of the computational complexities (1221) (randomized 
Kaczmarz method) and (T23]) (conjugate gradient algorithm) as functions of 
the ratio y — — (the common factors n 2 and log(l/e) in f[2"2"j) and ff2"3"j) are 
ignored in the two curves). 



Algorithm [H We apply CGLS to the full system of size 300 x 100 as well as to 
a randomly selected subsystem of size 272 x 100 (representing the optimal size 
en x n, computed above). Since we know the true solution we can compute 
the actual least squares error \\x — Xk\\ after each iteration. Each method is 
terminated after reaching the required accuracy e = 10~ 14 . We repeat the 
experiment 100 times and for each method average the resulting least squares 
errors. 

In Figure [3] we plot the averaged least squares error (y-axis) versus the 
number of floating point operations (x-axis), cf. Figure [3 We also plot the 
estimated convergence rate for both methods. Recall that our estimates pre- 
dict essentially identical bounds on the convergence behavior for CGLS and 
Algorithm [1] for the chosen parameters (m = 3n). Since in this example the 
performance of CGLS applied to the full system of size 300 x 100 is almost 
identical to that of CGLS applied to the subsystem of size 272 x 100, we display 
only the results of CGLS applied to the original system. 

While CGLS performs somewhat better than the (upper) bound predicts, 
Algorithm [1] shows a significantly faster convergence rate. In fact, the random- 



15 



ized Kaczmarz method is almost twice as efficient as CGLS in this example. 

In the second example we let m = 500, n = 100. In the same way as 
before, we illustrate the convergence behavior of CGLS and Algorithm [TJ In 
this example we display the convergence rate for CGLS applied to the full 
system (labeled as CGLS full matrix) of size 500 x 100 as well as to a random 
subsystem of size 272 x 100 (labeled as CGLS submatrix). As is clearly vis- 
ible in Figure H] CGLS applied to the subsystem provides better performance 
than CGLS applied to the full system, confirming our theoretical analysis. Yet 
again, Algorithm [1] is even more efficient than predicted, this time outper- 
forming CGLS by a factor of 3 (instead of a factor of about 2 according to our 
theoretical analysis). 



Randomized Kaczmarz 
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Figure 3: Comparison of rate of convergence for the randomized Kaczmarz 
method described in Algorithm [I] and the conjugate gradient least squares 
algorithm for a system of equations with a Gaussian matrix of size 300 x 100. 

Remark: An important feature of the conjugate gradient algorithm is that 
its computational complexity reduces significantly when the complexity of the 
matrix- vector multiplication is much smaller than 0(mn), as is the case e.g. 
for Toeplitz-type matrices. In such cases conjugate gradient algorithms will 
outperform Kaczmarz type methods. 
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Figure 4: Comparison of rate of convergence for the randomized Kaczmarz 
method described in Algorithm [1] and the conjugate gradient least squares 
algorithm for a system of equations with a Gaussian matrix of size 500 x 100. 



5 Some open problems 

In this final section we briefly discuss a few loose ends and some open problems. 

Kaczmarz method with relaxation: It has been observed that the conver- 
gence of the Kaczmarz method can be accelerated by introducing relaxation. 
In this case the iteration rule becomes 

II ®i II 2 

where the A^, i — 1, . . . , m are relaxation parameters. For consistent systems 
the relaxation parameters must satisfy [25] 

< liminffc^ooA^ < limsup fc _, 00 Afc ) i < 2 (25) 

to ensure convergence. 

We have observed in our numerical simulations that for instance for Gaus- 
sian matrices a good choice for the relaxation parameter is to set A^ := A = 
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1 + — for all k and i. While we do not have a proof for an improvement of per- 
formance or even optimality, we provide the result of a numerical simulation 
that is typical for the behavior we have observed, cf. Figure |5j 




Floating point operations (units in millions) 



Figure 5: Comparison of rate of convergence for the randomized Kaczmarz 
method with and without relaxation parameter. We have used A = 1 + — as 
relaxation parameter. 



Inconsistent systems: Many systems arising in practice are inconsistent due 
to noise that contaminates the right hand side. In this case it has been shown 
that convergence to the least squares solution can be obtained with (strong 
under) relaxation [5, 22j. We refer to [221 E3] for suggestions for the choice of 
the relaxation parameter as well as further in-depth analysis for this case. 

While our theoretical analysis presented in this paper assumes consistency 
of the system of equations, it seems quite plausible that the randomized Kacz- 
marz method combined with appropriate underrelaxation should also be useful 
for inconsistent systems. 
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